Skip to content

fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)#48

Merged
pureliture merged 2 commits into
mainfrom
claude/vuln-redaction-hardening
Jun 20, 2026
Merged

fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)#48
pureliture merged 2 commits into
mainfrom
claude/vuln-redaction-hardening

Conversation

@pureliture

Copy link
Copy Markdown
Contributor

Outcome

#47 머지 후 리뷰에서 발견된 공개안전 누출을 후속으로 수정한다. 중앙 vulnerability redaction이 allow-by-default라 raw source/secret/path/host가 VULN_FINDING JSONL과 M4 LLM 프롬프트로 새고(blocker A1·A2), 동시에 operator/SQL 백스톱이 정상 rule prose를 파괴했다. 멀티에이전트 리뷰(5차원) + adversarial red-team(4 angle)으로 양방향(누출 + 과검열)을 라이브 재현해 닫았다.

무엇을

  • redaction.pydeny-by-default 민감-클래스 redaction으로 재설계:
    • NFKC + 유니코드 homoglyph(U+2044/2215/2550/00B7 등) + %2F/%5C 분리자 정규화(우회 차단).
    • secret 패턴 확장: provider prefix(sk_/AIza/github_pat/glpat/npm/xox/gh[opsu]_) + 고엔트로피 bare 토큰 + PEM.
    • host TLD 광역화(.cloud/.xyz/.app…).
    • prose 파괴 백스톱 제거 → 정밀 SQL-statement(SELECT…FROM만) + 좁은 identifier-assignment(RHS가 식별자일 때만, 숫자/비교 제외).
    • backslash + bare-source-filename 경로.
  • sarif.py — owasp_tags를 A01..A10 토큰으로 정규화, persisted properties에서 raw tags 제거(A3).
  • model.pyverifier_verdict의 reason/remediation/error를 read 시 재검열(B1 방어).
  • verifier.py — verdict reason/remediation/error를 중앙 redaction 경유(B1).
  • runner.py — semgrep stderr가 중앙 redaction 재사용(B2).

닫힌 것

리뷰 findings A1·A2(blocker), A3·A4·B1·B2, 그리고 red-team의 16개 novel bypass(bare provider secret, 고엔트로피 토큰, homoglyph/URL-encoded 분리자, 광역 host, verifier error 필드). 동시에 red-team이 지적한 11개 over-redaction(update/Delete/==/len <= 255/SELECT query/std::string/data-flow ->/timeout = 30/실제 Semgrep rule id)이 모두 보존됨.

검증

  • tests/test_vulnerability_redaction.py47개 회귀(모든 attack 닫힘 + prose 보존을 양방향 고정).
  • uv run pytest 830 passed(secret-scanner 무회귀); public_safety/render --validate/--check/validate_ledger/rebuild_ledger_index --check 전부 green; ruff clean.
  • 테스트의 secret-shaped fixture는 fragment 조립이라 소스에 연속 secret 패턴 없음.

🤖 Generated with Claude Code

…review)

PR #47 review + adversarial red-team found the central vulnerability redaction
was allow-by-default, leaking raw source/secret/path/host into the VULN_FINDING
JSONL and the M4 LLM prompt, while an operator backstop also over-redacted
legitimate rule prose. Rework to robust sensitive-class redaction:

- redaction.py: NFKC + unicode-homoglyph + %2F/%5C separator folding; broadened
  secret patterns (provider prefixes sk_/AIza/github_pat/glpat/npm, high-entropy
  bare token, PEM); broadened host TLDs; precise SQL-statement and narrow
  identifier-assignment redaction instead of a prose-destroying operator/keyword
  backstop; backslash + bare-source-filename paths.
- sarif.py: owasp_tags normalized to A01..A10 tokens; raw 'tags' dropped from
  persisted properties.
- model.py: verifier_verdict reason/remediation/error re-sanitized on read.
- verifier.py: verdict reason/remediation/error routed through central redaction.
- runner.py: semgrep stderr reuses central redaction.
- tests: 47 redaction regressions locking every review + red-team attack closed,
  and asserting legitimate rule prose/ids survive (no over-redaction).

Closes review findings A1/A2 (blocker), A3/A4/B1/B2 and 16 red-team bypasses.
uv run pytest 830 passed; public_safety/render/ledger green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request hardens the vulnerability metadata redaction mechanism by introducing homoglyph normalization, percent-encoding handling, and comprehensive regex patterns to prevent leaks of secrets, paths, hosts, and other sensitive information. It also routes verifier verdicts and process details through this central redaction system and adds a robust test suite. The review feedback correctly identifies that several regex patterns (_HOST_RE, _SRC_FILE_RE, and _IDENTIFIER_PATH_RE) lack case-insensitivity, which could allow uppercase or mixed-case hostnames and file extensions to bypass redaction. It is recommended to apply the suggested (?i) flags and expand the test cases accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +79 to 85
_HOST_RE = re.compile(
r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"
r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"
r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"
r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"
r"(?::\d{2,5})?"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The regular expression _HOST_RE is compiled without case-insensitivity. Consequently, hostnames with uppercase TLDs or domains (e.g., evil.internal.EXAMPLE, DB.INTERNAL, EXAMPLE.COM) will bypass host redaction. Adding the (?i) inline flag at the beginning of the pattern ensures that all host variations are correctly redacted.

Suggested change
_HOST_RE = re.compile(
r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"
r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"
r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"
r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"
r"(?::\d{2,5})?"
)
_HOST_RE = re.compile(
r"(?i)(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"
r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"
r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"
r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"
r"(?::\d{2,5})?"
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다 (f83738f) — _HOST_RE(?i)를 적용해 대문자/혼합 host TLD(예: DB.PROD.ACME.CLOUD) 우회를 닫고, 대문자 host 회귀 테스트를 추가했습니다.

Comment on lines 103 to 107
_SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")
_IDENTIFIER_PATH_RE = re.compile(
rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"
r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The regular expressions _SRC_FILE_RE and _IDENTIFIER_PATH_RE are compiled without case-insensitivity. As a result, they only match lowercase file extensions (e.g., .py, .java, .cpp). If a SARIF file or source code reference contains uppercase or mixed-case extensions (e.g., App.Java, main.CPP, config.PY), they will completely bypass the redaction. Adding the (?i) inline flag at the beginning of the patterns ensures that all file extension variations are correctly redacted.

Suggested change
_SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")
_IDENTIFIER_PATH_RE = re.compile(
rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"
r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"
)
_SRC_FILE_RE = re.compile(rf"(?i)(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")
_IDENTIFIER_PATH_RE = re.compile(
rf"(?i)(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){1,}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"
r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다 (f83738f) — _SRC_FILE_RE/_IDENTIFIER_PATH_RE(?i)를 적용해 대문자 확장자(예: creds.JSON) 우회를 닫았습니다.

Comment on lines +47 to +56
@pytest.mark.parametrize(
"text, leak",
[
("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),
("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),
("the offending file is config.py here", "config.py"),
],
)
def test_a2_backslash_and_bare_paths_are_redacted(text, leak):
assert leak not in sanitize_vulnerability_text(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add test cases with uppercase and mixed-case file extensions to verify that case-insensitive file path redaction works correctly and to prevent future regressions.

Suggested change
@pytest.mark.parametrize(
"text, leak",
[
("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),
("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),
("the offending file is config.py here", "config.py"),
],
)
def test_a2_backslash_and_bare_paths_are_redacted(text, leak):
assert leak not in sanitize_vulnerability_text(text)
# --- A2: backslash (Windows) relative paths + bare source filenames ---
@pytest.mark.parametrize(
"text, leak",
[
("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),
("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),
("the offending file is config.py here", "config.py"),
("See src\\internal\\creds.PY for the leak", "src\\internal\\creds.PY"),
("the offending file is config.Py here", "config.Py"),
],
)
def test_a2_backslash_and_bare_paths_are_redacted(text, leak):
assert leak.lower() not in sanitize_vulnerability_text(text).lower()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다 (f83738f) — test_case_insensitive_paths_and_hosts_redacted에 대문자/혼합 확장자(Creds.JSON, App.PY, Creds.PY) 케이스를 추가했습니다.

Comment on lines +60 to +74
@pytest.mark.parametrize(
"text, leak",
[
(
"Reported by john.doe@internal-corp.example",
"john.doe@internal-corp.example",
),
("callback to 10.1.2.3 observed", "10.1.2.3"),
("host evil.internal.example reached", "evil.internal.example"),
("leak at src/secrets/master.key now", "secrets/master.key"),
("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),
],
)
def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):
assert leak not in sanitize_vulnerability_text(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add test cases with uppercase hostnames to verify that case-insensitive host redaction works correctly and to prevent future regressions.

Suggested change
@pytest.mark.parametrize(
"text, leak",
[
(
"Reported by john.doe@internal-corp.example",
"john.doe@internal-corp.example",
),
("callback to 10.1.2.3 observed", "10.1.2.3"),
("host evil.internal.example reached", "evil.internal.example"),
("leak at src/secrets/master.key now", "secrets/master.key"),
("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),
],
)
def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):
assert leak not in sanitize_vulnerability_text(text)
# --- A4: email / IP / host / unicode-separator paths ---
@pytest.mark.parametrize(
"text, leak",
[
(
"Reported by john.doe@internal-corp.example",
"john.doe@internal-corp.example",
),
("callback to 10.1.2.3 observed", "10.1.2.3"),
("host evil.internal.example reached", "evil.internal.example"),
("host EVIL.INTERNAL.EXAMPLE reached", "EVIL.INTERNAL.EXAMPLE"),
("leak at src/secrets/master.key now", "secrets/master.key"),
("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),
],
)
def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):
assert leak.lower() not in sanitize_vulnerability_text(text).lower()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

반영했습니다 (f83738f) — 같은 파라메트라이즈에 대문자/혼합 host(DB.PROD.ACME.CLOUD, Db.Prod.Acme.Cloud) 케이스를 추가했습니다.

gemini review flagged _HOST_RE TLD allowlist and _SRC_FILE_RE /
_IDENTIFIER_PATH_RE source-extension allowlist were case-sensitive, so
uppercase/mixed-case hosts (DB.PROD.ACME.CLOUD) and file extensions
(creds.JSON) bypassed redaction. Add (?i) and uppercase/mixed-case
regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pureliture pureliture merged commit 704f344 into main Jun 20, 2026
9 checks passed
@pureliture pureliture deleted the claude/vuln-redaction-hardening branch June 20, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant